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Abstract 

We consider high dimensional Wishart matrices XX^ where the entries of X £ are 

i.i.d. from a log-concave distribution. We prove an information theoretic phase transition: 
such matrices are close in total variation distance to the corresponding Gaussian ensemble if 
and only if d is much larger than n^. Our proof is entropy-based, making use of the chain rule 
for relative entropy along with the recursive structure in the definition of the Wishart ensemble. 
The proof crucially relies on the well known relation between Fisher information and entropy, 
a variational representation for Fisher information, concentration bounds for the spectral norm 
of a random matrix, and certain small ball probability estimates for log-concave measures. 


1 Introduction 

Let /i be a probability distribution supported on M with zero mean and unit variance. We consider 
a Wishart matrix (with removed diagonal) W = — diag(XX^)) / \/d where X is an n x d 

random matrix with i.i.d. entries from /i. The distribution of W, which we denote yVn,d{lA^ 
of importance in many areas of mathematics. Perhaps most prominently it arises in statistics as 
the distribution of covariance matrices, and in this case n can be thought of as the number of 
parameters and d as the sample size. Another application is in the theory of random graphs where 
the thresholded matrix Aij = \{Wi^j > r} is the adjacency matrix of a random geometric graph 
on n vertices, where each vertex is associated to a latent feature vector in (namely the 
row of X), and an edge is present between two vertices if the correlation between the underlying 
features is large enough. Wishart matrices also appear in physics, as a simple model of a random 
mixed quantum state where n and d are the dimensions of the observable and unobservable states 
respectively. 

The measure >Vn,d(At) becomes approximately Gaussian when d goes to infinity and n remains 
bounded (see Section 1.1). Thus in the classical regime of statistics where the sample size is 
much larger than the number of parameters one can use the well understood theory of Gaussian 
matrices to study the properties of >Vn,d(/i). In this paper we investigate the extent to which this 
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Gaussian picture remains relevant in the high-dimensional regime where the matrix size n also goes 
to infinity. Our main result, stated informally, is the following universality of a critieal dimension 
for sufficiently smooth measures /i (namely log-concave): the Wishart measure yVn,d{d) becomes 
approximately Gaussian if and only if d is mueh larger than n^. From a statistieal perspeetive this 
means that analyses based on Gaussian approximation of a Wishart are valid as long as the number 
of samples is at least the eube of the number of parameters. In the random graph setting this gives 
a dimension barrier to the extraetion of geometrie information from a network, as our result shows 
that all geometry is lost when the dimension of the latent feature spaee is larger than the eube of 
the number of vertiees. 

1.1 Main result 

Writing Xi E for the row of X one has for i ^ j, Wij = -^{Xi, Xj). In partieular 
EH/j j = 0 and KWijWi^k = and i ^ j}. Thus for fixed n, by the multivariate 

eentral limit theorem one has, as d goes to infinity. 


y^njid) Gn, 

where Qn is the distribution of a n x n Wigner matrix with null diagonal and standard Gaussian 
entries off diagonal (reeall that a Wigner matrix is symmetrie and the entries above the main di¬ 
agonal are i.i.d.). Reeall that the total variation distanee between two measures A, a is defined as 
TV(A, u) = sup^ |A(A) — h'{A)\ where the supremum is over all measurable sets A. Our main 
result is the following: 

Theorem 1 Assuming that /r is log-concave^ and d/{rA log^((i)) -E- -t-oo, one has 

( 1 ) 


Observe that for (1) to be true one needs some kind of smoothness assumption on /i. Indeed if n 
is purely atomic then so is W„,rf(/i), and thus its total variation distance to Qn is 1- We also remark 
that Theorem 1 is tight up to the logarithmic factor in the sense that if —)■ 0, then 

TX{WnAd),Gn)^l, (2) 


see Seetion 1.2 below for more details on this result. Finally we note that our proof in faet gives 
the following quantitative version of (1), where G > 1 is a universal eonstant. 




rA log(n) log^((i) 
d 


+ 



1.2 Related work and ideas of proof 

In the ease where /i is a standard Gaussian, Theorem I (without the logarithmie faetor) was re- 
eently proven simultaneously and independently in Bubeck et al. [2014], Jiang and Li [2013]. We 

' A measure g with density / is said to be log-concave if /(•) = for some convex function ip. 
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also observe that previously to these results eertain properties of a Gaussian Wishart were already 
known to behave as those of a Gaussian matrix, and for values of d mueh smaller than n^, see 
e.g. Johnstone [2001] for the largest eigenvalue at d ^ n, and Aubrun et al. [2014] on whether the 
quantum state represented by the Wishart is separable at d ^ . The proof of Theorem 1 for the 

Gaussian ease is simpler as both measures have a known density with a rather simple form, and 
one ean then explieitely eompute the total variation distanee as the Li distanee between the densi¬ 
ties. We also note that Bubeek et al. [2014] implieitely proves (2) for this Gaussian ease. Taking 
inspiration from the latter work, one ean show that in the regime —)■ 0, for any /i (zero mean, 

unit varianee and finite fourth moment^), one ean distinguish Wn,d(/i) and by eonsidering the 
statistie A E i—>■ Tr(A^). Indeed it turns out that the mean of Tr(A^) under the two measures 

are respeetively zero and 0(^) whereas the varianees are respeetively ©(n^) and 0(n^ + ^)- 

Proving normal approximation results without the assumption of independenee is a natural 
question and has been a subjeet of intense study over many years. One method that has found 
several applieations in sueh settings is the so ealled Stein’s method of exehangeable pairs. Sinee 
Stein’s original work (see Stein [1986]) the method has been eonsiderably generalized to prove 
error bounds on eonvergenee to gaussian distribution in various situations. The multidimensional 
ease was treated first in Chatterjee and Meekes [2007]. For several applieations of Stein’s method 
in proving CLT see Chatterjee [2014] and the references therein. In our setting note that 

d 

w = J2 - diag(X,X7)) /^/d 

i=l 

where the Xj are i.i.d vectors in M" whose coordinates are i.i.d samples from a one dimen¬ 
sional measure /i. Considering Yj = XjX^ — diag(XjX7) as a vector in MA and noting that 
|Yjp ~ n^, a straightforward application of Stein’s method using exchangeable pairs (see the 
proof of [Chatterjee and Meekes, 2007, Theorem 7]) provides the following suboptimal bound: the 
Wishart ensemble converges to the Gaussian ensemble (convergence of integrals against ‘smooth’ 
enough test functions) when d ^ nP. Whether there is a way to use Stein’s method to recover 
Theorem 1 in any reasonable metric (total variation metric, Wasserstein metric, etc.) remains an 
open problem (see Section 6 for more on this). 

Our approach to proving (1) is information theoretic and hence completely different from 
Bubeek etal. [2014], Jiang and Li [2013] (this is a necessity since for a general /r there is no 
simple expression for the density of JVn,(i(/w)). The first step in our proof, described in Section 2, is 
to use Pinsker’s inequality to change the focus from total variation distance to the relative entropy 
(see also Section 2 for definitions). Together with the chain rule for relative entropy this allows us 
to bound the relative entropy of Wn,d(/i) with respect to Qn by induction on the dimension n. The 
base case essentially follows from the work of Artstein et al. [2004] who proved that the relative 
entropy between the standard one-dimensional Gaussian and where Xi,... ,Xd e M is 

an i.i.d. sequence from a log-concave measure /r, goes to 0 at a rate 1/d. One of the main tech¬ 
nical contribution of our work is a certain generalization of the latter result in higher dimensions, 
see Theorem 2 in Section 3. Recently Ball and Nguyen [2012] also studied a high dimensional 

^ Note that log-concavity implies exponential tails and hence existence of all moments. 
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generalization of the result in Ball et al. [2003] (which contains the key elements for the proof 
in Artstein et al. [2004]) but it seems that Theorem 2 is not comparable to the main theorem in 
Ball and Nguyen [2012]. 

Another important part of the induction argument, which is carried out in Section 4, relies 
on controlling from above the expectation of — logdet(^XX^), which should be understood as 
the relative entropy between a centered Gaussian with covariance given by ^XX^ and a standard 
Gaussian in This leads us to study the probability that XX^ is close to being non-invertible. 
Denoting by Smin the smallest singular value of X, it suffices to prove a ‘good enough’ upper 
bound for P(smin(X^) < e) for all small e. The case when the entries of X are gaussian allows 
to work with exact formulas and was studied in Edelman [1988], Sankar et al. [2006]. The last 
few years have seen tremendous progress in understanding the universality of the tail behavior 
of extreme singular values of random matrices with i.i.d. entries from general distributions. See 
Rudelson and Vershynin [2010] and the references therein for a detailed account of these results. 
Such estimates are quite delicate, and it is worthwhile to mention that the following estimate was 
proved only recently in Rudelson and Vershynin [2008]: Let A G R”^'^ with (d > n) be a rectan¬ 
gular matrix with i.i.d. subgaussian entries then for all e > 0, 

< e{Vd-V^^)) < + c", 

where c, C are independent of n, d. In full generality, such estimates are essentially sharp since in 
the case where the entries are random signs, Smin is zero with probability Unfortunately this 
type of bound is not useful for us, as we need to control P(Smin(X’^) < e) for arbitrarily small 
scales e (indeed logdet(iXX^) would blow up if Smin can be zero with non-zero probability). It 
turns out that the assumption of log-concavity of the distribution allows us to do that. To this end 
we use recent advances in Paouris [2012] on small ball probability estimates for such distributions: 
Let Y G R” be an isotropic centered log-concave random variable, and e G (0,1/10), then one has 
P(|l^| < Ss/n) < {Ce)'^. This together with an e-net argument gives us the required control on 
P(Smin(X’^) < £)• 

We conclude the paper with several open problems in Section 6. 


2 An induction proof via the chain rule for relative entropy 


Recall that the (differential) entropy of a measure A with a density / (all densities are understood 
with respect to the Lebesgue measure unless stated otherwise) is defined as: 


Ent(A) = Ent(/) 


J f{x) log f{x)dx. 


The relative entropy of a measure A (with density /) with respect to a measure a (with density g) 
is defined as 


With a slight abuse of notations we sometimes write Ent(y||i/) where V is a random variable 
distributed according to some distribution A. Pinsker’s inequality gives: 


Ent(A||z/) = / /(a;)log 


/( 


9{ 


< ^Ent{Wn,dA)\\gn). 
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Next recall the chain rule for relative entropy states for any random variables Yi, Y 2 , Zi, Z 2 , 
Ent((Fi,F2)||(^i,^2)) = Ent(yillZi) +E,^AiEnt(F2|V'i = 2/||^2|^i = 2/), 

where Ai is the (marginal) distribution of Yi, and Y2IY1 = a; is used to denote the distribution of 
Y 2 conditionally on the event Yi = x (and similarly for Z 2 \Zi = y). Also observe that a sample 
from >Vn+i,d(/i) can be obtained by adjoining to (XX^ — diag(XX^)) / \fd (whose distribution is 
the column vector XX/y/d (and the row vector (XX^/y/d) where X G has i.i.d. 
entries from y. Thus denoting 7 ^ for the standard Gaussian measure in MY we obtain 

Ent(>V„+i,,(/i)||6;„+i) = Ent{WnAf^)\K) + Ex Ent (xX/Vd \ XX^|| 7 „) . (3) 

By convexity of the relative entropy (see e.g., Cover and Thomas [1991]) one also has: 

Ex Ent{XX/y/d \ XX^|| 7 „) < Ex Ent{XX/y/d \ X|| 7 „). (4) 


Next we need a simple lemma to rewrite the above term: 

Lemma 1 Let A G and Q G be such that QAA^Q^ = I„. Then one has for any 
isotropic random variable X G 

1 77 1 

Ent(AX||7„) = Ent(QAX||7„) + -Tr(AA"^) - 2 + 2^°Sdet(Q)- 


Proof Denote <f)s for the density of a centered Gaussian with covariance matrix S, and let G ~ 7 „. 
Also let / be the density of QAX. Then one has: 


Ent (AX 11G) = Ent (QAX 11 gX) 

fix) 


fix) log 
fix) log 


^QQ^ix 
fix) 


dx 


dx+ f{x) log 




$ 


QQ^ 


[X 


dx 


= Ent(QAX||G) + J fix)(^x^{QQ^) + ^logdet(Q)^ 


1 77 1 

= Ent(Q/lX||G) + -Tt ((QQ^)-‘) - - + -logdet(Q), 


where for the last equality we used the fact that QAX is isotropic, that is J f{x)xx~^dx = I„. 
Finally it only remains to observe that Tr ((QQ^)“^) = Tr(AA^). ■ 

Combining (3) and (4) with Lemma 1 (noting that one can take Q = (^XX^)“^/^), and using that 
E Tr(XX''') = n, one obtains 


Ent(>V„+i,d(/i)||^n+i) 

< Ent{WnAT)\\Gn) + Ex Ent {{XX'^)-^/^X X \ X || 7 „) - ^Ex logdet(ixxT). (5) 

In Section 3 we show how to bound the term Ent(AX|| 7 „) where A G has orthonormal rows 
(i.e., AA^ = I„), and then in Section 4 we deal with the term Ex logdet(^XX^). 


5 





3 A high dimensional entropic CLT 

The main goal of this section is to prove the following high dimensional generalization of the 
entropic CLT of Artstein et al. [2004], 

Theorem 2 Let Y & be a random vector with i.i.d. entries from a distribution v with zero 
mean, unit variance, and spectral gap^ c G (0,1]. Let A G be a matrix such that AA~^ = I„. 
Let e = maXjg[c;](A'''A)j_j and ( = \ {A^A)ij\. Then one has 

Ent(AF|| 7 „) < nmin(2(£ + (^d)/c, 1) Ent(i^|| 7 i). 

Note that the assumption AA~’' = implies that the rows of A form an orthonormal system. In 
particular if A is built by picking rows one after the other at uniform on the Euclidean sphere in 
conditionally on being orthogonal to previous rows, then one expects that e ~ n/d and ( ~ ^Jnjd. 
Theorem 2 then yields Ent(Ay || 7 „) < r? jd. Thus we already see appearing the term rA jd from 
Theorem 1 as we will sum the latter bound over the n rounds of induction (see Section 2). 

We also note that for the special case n = 1, Theorem 2 is slightly weaker than the result of 
Artstein et al. [2004] which makes appear the ^ 4 -norm of A. 

Section 3.1 and Section 3.2 are dedicated to the proof of Theorem 2. Then in Section 3.3 we 
show how to apply this result to bound the term Ex Ent(QXX/s/d | X|| 7 „) from Section 2. 

3.1 From entropy to Fisher information 

For a density function w : R"' M+, we denote J(w) = f dx for its Fisher information, 

and I{w) = f dx for the Fisher information matrix (if u denotes the measure whose 

density is w, we may also write J{u) instead of J{w)). Also denote Pt for the Omstein-Uhlenbeck 
semigroup, that is with G 'jn one has for a random variable Z with density g, 

PtZ = exp{—t)Z + — exp(—2f)G, 

and Ptg is the density of PtZ. The de Bruijn identity states that the Fisher information is the time 
derivative of the entropy along the Omstein-Uhlenbeck semigroup, more precisely one has: 

POO 

Ent(t(;|| 7 „) = Ent( 7 „) — Ent(t(;) = / {J{Ptw) — n)dt. 

Jo 

Our objective is to prove a bound of the form (for some constant C depending on A) 

Ent(Ay|| 7 „) < C Ent(i/|| 7 i), (6) 

and thus given the above identity it suffices to show that for any f > 0, 

J{ht)-n<C {J{ut)-1), (7) 

probability measure g is said to have spectral gap c if for all smooth functions g with E^(p) = 0, we have 

W) < 


6 






where ht is the density of PtAY (whieh is equal to the density of APtY) and is such that PtY 
has distribution Furthermore if ei,..., e„ denotes the canonical basis of R", then to prove (7) 
it is enough to show that for any i E [n], 

eJlih)ei-l<a{J{ut)-l), ( 8 ) 

where We will show that one can take 

cUf 

r*. = 1_ » 

* cW, + 2Vi’ 

where we denote B = A'^A E and 




- -s«)' W'i = - Bm?, Vi 


i=i 


i=i 


j,ke[dl,k^j 


Straightforward calculations (using that Ui > 1 — e, Wi < 1, and Vi < show that one has 
Xir=i (l - cw%v^ - + Cd)/c where e = Bi^i and C = Biaxij^[d],ii^j \Bij\, thus 

concluding the proof of Theorem 2. 

In the next subsection we prove (8) for a given f > 0 and i = 1. We use the following well 
known but crucial fact: the spectral gap of vt is in [c, 1] (see [Proposition 1, Ball et al. [2003]]). 

Denoting / for the density of Ut, one has with cp = — log / that J := J(r't) = / (p"{x)dp{x). 
The last equality easily follows from the fact that for any t > 0 one has f f" = 0 (which itself 
follows from the smoothness of Vt induced by the convolution of iv with a Gaussian). 


3.2 Variational representation of Fisher information 

Let Z G R'^ be a random variable with a twice continuously differentiable density w such that 
I ^ I ^ d the density of AZ E R"^. Our main tool is a remarkable 

formula from Ball et al. [2003], which states the following: for all e G R” and all sufficiently 
smooth map p : R'^ ^ R'^ with Ap{x) = e,\/x E R'^, one has (with Dp denoting the Jacobian 
matrix of p). 


e^I{h)e< J ^j:{Dp{x)‘^) + p{xy'\/‘^{—\ogw{x))p{x)jw{x)dx. (9) 

For sake of completeness we include a short proof of this inequality in Section 5. 

Let (oi,..., ad) be the first row of A. Following Artstein et al. [2004], to prove (7), we would 
like to use the above formula^ with p of the form (air(a:i),..., adr{xd)) for some map r : R —)■ R. 
Since we need to satisfy Ap{x) = ei we adjust the formula accordingly and take 

p{x) = (In - A’^A)(air(xi),..., adr{xd))~^ + A’^ei. 

"^Note that the smoothness assumptions on w are satisfied in our context since we consider a random variable 
convolved with a Gaussian. 
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In particular we get, with B = A~’'A, 


and 


Pi(x) = Oi + ai(l - Bi^i)r(xi) - ^ Bijajr{xj), 


= I “ Bi^i)r'{xi) Hi =j 
dxj ^ \ —Bijajr'{xj) otherwise. 


Next recall that we apply (9) to prove (8) where w{x) = nf=i which case we have (recall 

also the notation = — log /): 

d 

p{xyV‘^{-\ogw{x))p{x) = ^Pi{xfip"{Xi) 

i=l 

^ ( 

= ^ p}”{xi) I Oj + ai(l - Bi^i)r{xi) - ^ Bijajr{xj) 


2 = 1 


We also have 


Ar{Dp{x)‘^) = ^a^(l - Bi^i)‘^r'{xif + ^ Bljaiajr'{xiy{xj). 


2=1 


Putting the above together we obtain (with a slightly lengthy straightforward computation) that 


" 7 1 ih)ei is upper bounded by (recall also that ^= 1 and ^ . Bi^aj = ai since BA^ = A'^) 


J + W{ f{r'y+ fyry+JVfr^ + J{W-V){ fr 


( 10 ) 


+2U / fyr -J fr]-2W{ fr] { / f^r + M fr' 


where 

d d 

U^J2‘^Ui-Bu),W = J24{i-Bur,V^ {B.jajr,M= B^a^ay 


2=1 


2 = 1 


Observe that by Cauchy-Sehwarz inequality one has M <V, and furthermore following Artstein et al. 
[2004] one also has with m = J fr, 

(/ ” (/ “ “0 " (/ " “0 (/ ~ ■ 

Thus we get fom (10) and the above observations that ejI{ht)ei — J < T(r) where 

T(r) = w( [ f{r'y + [ fyrA +2Jv( [ frA + J{W - 2V) ( [ fA 


+ 2U fyr -J fr]-2W{ fr]{ / f^r , 
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which the exact same quantity as the one obtain in Artstein et al. [2004]. The goal now is to 
optimize over r to make this quantity as negative as possible. Solving the above optimization 
problem is exactly the content of [Artstein et ah, 2004, Section 2.4] and it yields the following 
bound: 


ell{ht)ei - 1 < 


1 


cU^ 

cW + 2V 




which is exactly the claimed bound in (8). 


3.3 Using Theorem 2 


Given (5) we want to apply Theorem 2 with A = (also observe that the spectral 

gap assumption of Theorem 2 is satisfied since log-concavity and isotropy of /i impy that /r has a 
spectral gap in [1/12,1], Bobkov [1999]). In particular we have A'^A = X^(XX^)“^X, and thus 
denoting Xj G M” for the column of X one has for any i,j G [d]. 

In particular this yields: 

< ^|X7X,-| + ■ |X,| ■ IK^XXT' - U||. 

We now recall two important results on log-concave random vectors^. First Paouris’ inequality 
Pao states that for an isotropic, centered, log-concave random variable Y G MA one has for any 
t > C, 

> (1 + t)\/n) < exp{-ct^/n), 

where c, C are universal constants. We also need an inequality proved by Adamczak, Litvak, Pajor 
and Tomczak-Jaegermann Adamczak et al. [2010] which states that for a sequence Yi,..., G 
M" of i.i.d. copies of Y, one has for any t > 1 and e G (0, 1), 


P 


d 


E - 1 . 


i=l 


> e 


< exp{—cty/n), 


provided that d > log { 2 ^^^ n. 

Paouris’ inequality directly yields that for any f G [d], with probability at least 1 — d, one has 

|Xi| <^/n + - log(l/(5). 
c 

Furthermore, by Prekopa-Leindler, conditionally on X^ one has for i ^ j that X/is a cen¬ 
tered, isotropic, log-concave random variable. In particular using again Paouris’ inequality and 
independence of Xj and X^ one obtains that for i ^ j, with probability at least 1 — 5, 

|X7X,|<|X,|(^l + llog(l/5)y 

^We note that more classical inequalities could also be used here since the entries of X are independent. This would 
slightly improve the logarithmic factors but it would obscure the main message of this section so we decided to use 
the more general inequalities for log-concave vectors. 
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Finally the inequality from Adamezak, Litvak, Pajor and Tomezak-Jaegermann yields that with 
probability at least 1 — 5, 




( 11 ) 


Also note that if ||A — I„|| < e < 1 then ||A“^ — I„|| < From now on C denotes a universal 
constant whose value can change at each occurence. Putting together all of the above with a union 
bound, we obtain for d > Cm? that with probability at least 1 — l/d, for all i ^ j, 

|Xj| < C{y/n + \og{d)), 

|X7X,| <C(^/^log(d) + log^(d)), 


-XX^) 
d ' 


T\-l 




d 


n 


This yields that for i ^ j. 


and 


n + log^((i) 


\{A'A),,\<C 


d 


Thus denoting s = maXi(z[d]{A~^A)i^i and ( = \ {A~^A)ij \ one has: 

d 


4 Small ball probability estimates 


The goal of this section is to upper bound E — logdet(iXX^). 

Lemma 2 

E(-logdet(ixX-))<c(y|+^). (12) 

Proof We decompose this expectation on the event (and its complement) that the smallest eigen¬ 
value Amin of is less than 1/2. We first write, using — log(a;) < 1 — x + (1 — x)^ for 

X > 1/2, 


E 


-logdet(-XX''~)l{A min > 1/2} | < E 


Tr(l, 



Denote C for the 4*^ moment of /i. Then one has (recall that Xi G M? denotes the row of X) 


E 







2=1 



n 

d' 
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Similarly one can easily check that 


E 


■"-5 


HS 






n — n n, ^ , n 

n= -(C-l)< 


d d 


d 


Next note that by log-concavity of /i one has ( < 70, and thus we proved (for some universal 
constant C > 0): 


E (^-logdet(ixX^)l{A^i, > 1/2}^ < C . (13) 

We now take care of the integral on the event {Amin < 1/2}- First observe that the inequality (11) 
from Adamczak, Litvak, Pajor and Tomczak-Jaegermann gives for d > C, 

P(Amin < 1/2) < exp(-d^/^°). 


In particular we have for any ^ G (0,1): 


E 


-logdet(-XX’^)l{Amm < 1/2} ) < nE (-log(Amin) 1{Amin < 1/2}) 


= n P(- log(Amin) > t)dt 

Alog(2) 

/•1/2 1 

= n -P(Amm < s)ds 
Jo s 

<^exp(—d^^^^)+n [ -P(Amin < s)(is. 
^ Jo S 


(14) 


Thus it remains to control P(Amin < s) for s small enough. This is essentially a small ball problem. 
We follow a standard route, by using an e-net argument together with a basic small ball probability 
estimate. First observe that using the subexponential tail of isotropic log-concave random vari¬ 
ables, together with the e-net argument, one easily obtains for any M > C, P(Amax > M) < 
exp(—cM) where Amax is the largest eigenvalue of ^XX^. Recall that 


P(Amin < S) = P 


(^30 e 


:e^ 


XXT A 


P (^30 G : \X^9\ < Vld^ . 


Furthermore if |^X^6*| < y/s for some 9 E \ then one has for any G S” |;^X^(^| < 
\/s + Amax 1Thus we get with the e-net argument and the above display: 


P(Amin < S) < P(|X^0| < 2v/id) + P(Amax > I / ^/~s). 


We now use the Paouris small ball probability bound Paouris [2012] which states that for an 
isotropic centered log-concave random variable Y G 'MJ, and any e G (0,1/10), one has 

P(|X| < eVd) < (ce)'^. 
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Thus we obtain for d > Cn'^ 


P(Amin < s) < + exp{-C/y/s). 

Finally plugging this back in (14) we obtain for d > Cn\ 

E logdet(-XX''')l{Amin}^ < nexp{—d^^‘^^), 

and thus together with ( 1 3) it yields (12). ■ 


5 Proof of (9) 

Recall that Z G is a random variable with a twice continuously differentiable density w such 
that f < cxo and J || V^tu|| < oo, his the density of AZ G (with AA^ = I„), and also we 

fix e G and a sufficiently smooth map^ p : —)■ with Ap{x) = e, Vx G We want to 
prove: 

e^I{h)e< J log w)p^w . (15) 

First we rewrite the right hand side in (15) as follows: 

[ (tt{Dp‘^) +p'^V\- log w)p]w = [ 


The above identity is a straightforward calculation (with several applications of the one-dimensional 
integration by parts, which are justified by the assumptions on p and w), see Ball et al. [2003] for 
more details. Now we rewrite the left hand side of (15). Using the notation for the partial 
derivative of a function g in the direction x, we have 



Next observe that for any x G M"' one can write h{x) = w{AJx + •) where E is the n- 
dimensional subspace generated by the orthonormal rows of A, and thus thanks to the assumptions 
on w one has: 

he{x) = I WATe = / V ■ {{A~^e)w) . 

Ja^x+E^ Ja^x+E^ 

The key step is now to remark that the condition Vx, Ap{x) = e exactly means that the projection 
of p on U is A'^e, and thus by the Divergence Theorem one has 


'A^x+E-L 


V • {{A'^e)w) = 


< A^ x+E^ 


V • {pw) . 


®For instance it is enough that p is twice continuously differentiable, and that the coordinate functions pi and their 


derivatives -r^. 


dpi dpi d^pi 


dxi^ dxj^ dxidx^^ 


are bounded. 
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The proof is concluded with a simple Cauchy-Schwarz inequality: 


I{h)e = 


(Iatcc+e^ V • (pw) 


I 


< 


(V • (pw))" 


/IT x+E-^ 


W 


'A^x+E^ 


W 


r (V • {pw)y 

IRd W 


6 Open problems 

This work leaves many questions open. A basic question is whether one could get away with 
less independence assumption on the matrix X. Indeed several of the estimates in Section 3 and 
Section 4 would work under the assumption that the rows (or the columns) of X are i.i.d. from a 
log-concave distribution in W'- (or M"^). However it seems that the core of the proof, namely the 
induction argument from Section 2, breaks without the independence assumption for the entries of 
X. Thus it remains open whether Theorem 1 is true with only row (or column) independence for 
X. We note that the case of row independence is probably much harder than column independence. 

As we observed in Section 1.2, a natural alternative route to prove Theorem 1 (or possibly a 
variant of it with a different metric) would be to use Stein’s method. A straightforward application 
of existing results yield the suboptimal dimension dependency d 3> n® for convergence, and it is 
an intriguing open problem whether the optimal rate d 3> can be obtained with Stein’s method. 

In this paper we consider Wishart matrices with zeroed out diagonal elements in order to avoid 
further technical difficulties (also for many applications -such as the random geometric graph 
example- the diagonal elements do not contain relevant information). We believe that Theorem 
1 remains true with the diagonal included (given an appropriate modification of the Gaussian en¬ 
semble). The main difficult is that in the chain rule argument one will have to deal with the law 
of the diagonal elements conditionally on the other entries. We leave this to further works, but 
we note that when p is the standard Gaussian it is easy to conclude the calculations with these 
conditional laws. 


In Eldan [2015] it is proven that when /i is a standard Gaussian and d/n -|-cxo, one has 
TV(>V„,rf(/i), >Vn,d+i(/i)) — )■ 0. It seems conceivable that the techniques develop in this paper 
could be useful to prove such a result for a more general class of distributions p. However a major 
obstacle is that the tools from Section 3 are strongly tied to measuring the relative entropy with 
respect to a standard Gaussian (because it maximizes the entropy), and it is not clear at all how to 
adapt this part of the proof. 

Finally one may be interested in understanding CLT of the form (1) for higher-order inter¬ 
actions. More precisely recall that by denoting Xj for the column of X one can write XX^ = 
Xj(8)Xj = Forp e M we may now consider the distribution of ^ 

(for sake of consistency we should remove the non-principal terms in this tensor). The measure 
have recently gained interest in the machine learning community, see Anandkumar et al. 
[2014]. It would be interesting to see if the method described in this paper can be used to under¬ 
stand how large d needs to be as a function of n and p so that is close to being a Gaussian 
distribution. 


13 





Acknowledgements 

The authors thank Assaf Naor for some useful discussions. This work was completed while S.G. 
was an intern at Microsoft Research in Redmond. He thanks the Theory group for its hospitality. 


References 


Radoslaw Adamczak, Alexander Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann. Quantita¬ 
tive estimates of the convergence of the empirical covariance matrix in log-concave ensembles. 
Journal of the American Mathematical Society, 23(2):535-561, 2010. 

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor 
decompositions for learning latent variable models. Journal of Machine Learning Research, 15: 
2773-2832, 2014. 

Shin Artstein, Keith M Ball, Franck Barthe, and Assaf Naor. On the rate of convergence in the 
entropic central limit theorem. Probability theory and related fields, 129(3):381-390, 2004. 

Guillaume Aubrun, Stanislaw J Szarek, and Deping Ye. Entanglement thresholds for random 
induced states. Communications on Pure and Applied Mathematics, 67(1):129-171, 2014. 

Keith Ball and Van Hoang Nguyen. Entropy jumps for log-concave isotropic random vectors and 
spectral gap. Studia Math., 213(l):81-96, 2012. 

Keith Ball, Eranck Barthe, and Assaf Naor. Entropy jumps in the presence of a spectral gap. Duke 
Mathematical Journal, 119(l):41-63, 2003. 

Sergey Bobkov. Isoperimetric and analytic inequalities for log-concave probability measures. An¬ 
nals of Probability, 27(4): 1903-1921, 1999. 

Sebastien Bubeck, Jian Ding, Ronen Eldan, and Miklos Z. Racz. Testing for high-dimensional 
geometry in random graphs. arXiv preprint arXiv:1411.5713, 2014. To appear in Random 
Structures and Algorithms. 

Sourav Chatterjee. A short survey of stein’s method. arXiv preprint arXiv:1404.1392, 2014. 

Sourav Chatterjee and Elizabeth Meckes. Multivariate normal approximation using exchangeable 
pairs. arXiv preprint math/0701464, 2007. 

Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, 1991. 

Alan Edelman. Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix 
Analysis and Applications, 9(4):543-560, 1988. 

Ronen Eldan. An efficiency upper bound for inverse covariance estimation. Israel Journal of 
Mathematics, 207(1): 1-9, 2015. 


14 


Tiefeng Jiang and Banning Li. Approximation of rectangular beta-laguerre ensembles and large 
deviations. Journal of Theoretical Probability, pages 1-44, 2013. 

Iain M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. 
The Annals of Statistics, 29(2):295-327, 2001. 

Grigoris Paouris. Small ball probability estimates for log-concave measures. Transactions of the 
American Mathematical Society, 364(l):287-308, 2012. 

Mark Rudelson and Roman Vershynin. The littlewood-offord problem and invertibility of random 
matrices. Advances in Mathematics, 218(2):600-633, 2008. 

Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme sin¬ 
gular values. arXiv preprint arXiv:1003.2990, 2010. 

Arvind Sankar, Daniel A Spielman, and Shang-Hua Teng. Smoothed analysis of the condition 
numbers and growth factors of matrices. SIAM Journal on Matrix Analysis and Applications, 28 
(2):446-476, 2006. 

Charles Stein. Approximate computation of expectations. Lecture Notes-Monograph Series, 7: 
i-164, 1986. 


15 


